COMMUNICATION-OPTIMAL PARALLEL AND SEQUENTIAL 
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GREY BALLARD t, JAMES DEMMEL *, OLGA HOLTZ §, AND ODED SCHWARTZ 1 

Abstract. Numerical algorithms have two kinds of costs: arithmetic and communication, by 
which we mean either moving data between levels of a memory hierarchy (in the sequential case) or 
over a network connecting processors (in the parallel case). Communication costs often dominate 
arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper 
we first extend known lower bounds on the communication cost (both for bandwidth and for latency) 
of conventional (0(n 3 )) matrix multiplication to Cholesky factorization, which is used for solving 
dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky 
decomposition implementations to these lower bounds and identify the algorithms and data structures 
that attain them. In the sequential case, we consider both the two-level and hierarchical memory 
models. Combined with prior results in [131 1141 115] . this gives a set of communication-optimal 
algorithms for 0(n 3 ) implementations of the three basic factorizations of dense linear algebra: LU 
with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing 
communication for any number of levels of memory hierarchy. 

Key words. Cholesky decomposition, bandwidth, latency, communication avoiding, algorithm, 
lower bound. 

1. Introduction. Let A be a real symmetric and positive definite matrix. Then 
there exists a real lower triangular matrix L so that A = L ■ L T (L is unique if we 
restrict its diagonal elements to be positive). This is called the Cholesky decomposi- 
tion. We are interested in finding efficient parallel and sequential algorithms for the 
Cholesky decomposition. Efficiency is measured both by the number of arithmetic 
operations and by the amount of communication, either between levels of a memory 
hierarchy on a sequential machine, or between processors communicating over a net- 
work on a parallel machine. Since the time to move one word of data typically exceeds 
the time to perform one arithmetic operation by a large and growing factor 18 , our 
goal will be to minimize communication. 

1.1. Communication model. We model communication costs in more detail 
as follows. In the sequential case, with two levels of memory hierarchy (fast and slow), 
communication means reading data items (words) from slow memory to fast memory 
and writing data from fast memory to slow memory. Words that are contiguous can be 
read or written in a bundle which we will call a message. We assume that a message 
of n words can be communicated between fast and slow memory in time a + f3n 
where a is the latency (seconds per message) and /? is the inverse bandwidth (seconds 
per word). We define the bandwidth cost of an algorithm to be the total number of 
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words communicated and the latency cost of an algorithm to be the total number of 
messages communicated. We assume that the matrix being factored initially resides 
in slow memory, and is too large to fit in the smaller fast memory. Our goal then is to 
minimize the total number of words and the total number of messages communicated 
between fast and slow memoryj^] 

In the parallel case, we are interested in the communication among the processors. 
As in the sequential case, we assume that a message of n consecutively stored words 
can be communicated in time a + fin. This cost includes the time required to "pack" 
non-contiguous words into a single message, if necessary. We assume that the matrix 
is initially evenly distributed among all P processors, and that there is only enough 
memory to store about 1/P-th of a matrix per processor. As before, our goal is to 
minimize the number of words and messages communicated. In order to measure the 
communication complexity of a parallel algorithm, we will count the number of words 
and messages communicated along the critical path of the algorithm. That is, if many 
pairs of processors send identical messages simultaneously, the cost along the critical 
path of the algorithm is that of one message (we do not consider communication 
resource contention among processors). 

1.2. Communication Lower Bounds. We consider classical algorithms for 
Cholesky decomposition, i.e., those that perform "the usual" 0(n 3 ) arithmetic oper- 
ations, possibly reordered by associativity and commutativity of addition. That is, 
our results do not apply when using distributivity to reorganize the algorithm (such 
as Strassen-like algorithms); we also assume no pivoting is performed. We define 
"classical" more carefully later. We show that the communication complexity of any 
such Cholesky algorithm shares essentially the same lower bound as does the classical 
matrix multiplication. 

Theorem 1.1 (Main Theorem). Any sequential or parallel classical algorithm 
for the Cholesky decomposition of n-by-n matrices can be transformed into a classical 
algorithm for matrix multiplication, in such a way that the bandwidth cost of 

the matrix multiplication algorithm is at most a constant times the bandwidth cost of 
the Cholesky algorithm. 

Therefore any bandwidth cost lower bound for classical matrix multiplication 
applies to classical Cholesky decomposition, in a Big-0 sense. In particular, since a 
sequential classical n-by-n matrix multiplication algorithm has a bandwidth cost lower 
bound of fl(n 3 /M 1 / 2 ) where M is the fast memory size [UJ 123 , classical Cholesky 
decomposition has the same lower bound (we discuss the parallel case later). 

To get a latency cost lower bound, we use the simple observation [T3] that the 
number of messages is at least the bandwidth cost lower bound divided by the maxi- 
mum message size, and that the maximum message size is at most fast memory size 
in the sequential case (or the local memory size in the parallel case). So for sequential 
matrix multiplication this means the latency cost lower bound is i7(n 3 /M 3 / 2 ). 

1.3. Communication Upper Bounds. 

1.3.1. Two-level memory model. In the case of matrix multiplication on a 
sequential machine, a well-known algorithm attains the bandwidth cost lower bound 



lr The sequential communication model used here is sometimes called the two-level I/O model or 
disk access machine (DAM) model (see 2 !) 12 ). Our bandwidth cost model follows that of |23| and 
|25| in that it assumes the block-transfer size is one word of data (B = 1 in the common notation). 
However, our model allows message sizes to vary from one word up to the maximum number of words 
that can fit in fast memory. 
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|23j . It computes C = A ■ B by performing blocked matrix multiplication that mul- 
tiplies and accumulates y^^-by-y 1 ^ submatrices of A, B and C. If each matrix is 

stored so that the 4p entries of each of its submatrices are contiguous (not the case 
with columnwise or rowwise storage) the latency cost lower bound is reached; we call 
such a data structure block- contiguous storage and describe it in more detail below. 
Alternatively, one could try to copy A and B from their input format (say columnwise) 
to contiguous block storage doing (asymptotically) no more communication than the 
subsequent matrix multiplication; we will sec this is possible provided M = f2(n). 
There will be analogous requirements for Cholesky to attain its latency cost lower 
bound. 

In particular, we draw the following conclusions about the communication costs 
of sequential classical Cholesky decomposition in the two-level memory model, as 
summarized in Table ITTT1 

1. "Naive" sequential variants of Cholesky that operate on single rows and 
columns (be they left-looking, right-looking, etc.) attain neither the band- 
width nor the latency cost lower bound. 

2. A sequential blocked algorithm used in LAPACK (with the correct block size) 
attains the bandwidth cost lower bound, as do the recursive algorithms in 
IH [20l [2TJ [27] . A recursive algorithm analogous to Toledo's LU algorithm [28] 
attains the bandwidth cost lower bound in nearly all cases, expect possibly 
for an O(logn) factor in the narrow range lo ^ 2 - < M < n 2 . 

3. Whether the LAPACK algorithm also attains the latency cost lower bound 
depends on the matrix layout: If the input matrix is given in row-wise or 
column- wise format, and this is not changed by the algorithm, then the la- 
tency cost lower bound cannot be attained. But if the input matrix is given 
in contiguous block storage, or M = fi(n) so that it can be copied quickly to 
contiguous block format, then the latency cost lower bound can be attained 
by the LAPACK algorithm. Toledo's algorithm cannot minimize latency (at 
least when M > n 2 / 3 ) See also Conclusion [H] below. 

1.3.2. Hierarchical memory model. Since most computers have multiple lev- 
els (e.g., LI, L2, L3 caches, main memory, and disk), we also consider the commu- 
nication costs in the hierarchical memory model. In this case, an optimal algorithm 
should simultaneously minimize communication between all pairs of adjacent levels of 
memory hierarchy (e.g., minimize bandwidth and latency costs between LI and L2, 
between L2 and L3, etc.). 

In the case of sequential matrix multiplication, bandwidth cost is minimized in 
this sense by simply applying the usual blocked algorithm recursively, where each level 
of recursion multiplies matrices that fit at a particular level of the memory hierarchy, 
by using the blocked algorithm to multiply submatrices that fit in the next smaller 
level. This is easy since matrix multiplication naturally breaks into smaller matrix 
multiplications. 

For matrix multiplication to minimize latency across all memory hierarchy levels, 
it is necessary for all submatrices of all sizes to be stored contiguously. This leads to 
a data structure variously referred to as recursive block storage, Morton ordering, or 
storage using space-filling curves, and described in [31 El HE1 HZ1 US] • 

Finally, sequential matrix multiplication can achieve communication optimality 
as just described in one of two ways: (1) We can choose the number of recursion levels 
and sizes of the subblocks with prior knowledge of the number of levels and sizes of 



4 



BALLARD, DEMMEL, HOLTZ, AND SCHWARTZ 



the levels of memory hierarchy, a cache-aware process called tuning. (2) We carry out 
the recursion down to 1-by-l blocks (or some other small constant size), repeatedly 
dividing block sizes by 2 (perhaps padding submatrices to have even dimensions as 
needed). Such an algorithm is called cache- oblivious |17j . and has the advantage of 
simplicity and portability compared to a cache-aware algorithm, though it might also 
have more overhead in practice. 

It is indeed possible for sequential Cholesky to be organized to be optimal across 
multiple memory hierarchy levels in all the senses just described, assuming we use 
recursive block storage: 

4. The recursive algorithm modeled on Toledo's LU can be implemented in a 
cache-oblivious way so as to minimize bandwidth cost, but not latency costj^] 

5. The cache-oblivious recursive Cholesky algorithm which first appeared in [20] 
(and can also be found in (3] 0J [21j [27] ) minimizes both bandwidth and latency 
costs (assuming recursive block storage) for all matrices across all memory 
hierarchy levels. None of the other algorithms have this property. 

1.3.3. Parallel model. Finally, we address the case of parallel Cholesky, where 
there are P processors connected by a network with latency a and reciprocal band- 
width f3. We consider here only the memory-scalable case, where each processor's 
local memory is of size M = 0(n 2 /P), so that only 0(1) copies of the matrix are 
stored overall (the so-called "2D case"). See [35] for the general lower bound, including 
3D, for matrix multiplication. A 3D algorithm for LU decomposition without pivot- 
ing (along with an algorithm for triangular solve) is presented with communication 
analysis in [24j . While the algorithms minimize the total volume of communication, 
they do not minimize communication costs along the critical path as measured in this 
paper. 

For the 2D case, the consequence of our Main Theorem is again a bandwidth 
cost lower bound of the form Q(n 3 /(PM 1 ' 2 )) = Q(n 2 / P 1 / 2 ), and a latency cost lower 
bound of the form ft(n 3 /(PM 3 / 2 )) = ^(P 1 / 2 ). 

6. The Cholesky algorithm in ScaLAPACK [IT] attains a matching upper 
bound. It does so by partitioning the matrix into submatrices and distribut- 
ing them to the processors in a block cyclic manner. With the right choice 

of block size b, namely b — f J=Y it attains the above bandwidth and 

latency cost lower bounds to within a factor of log P. This is summarized in 
Table |L2] See §[33T]for details. 

The rest of this paper is organized as follows. In Sj2]we show the reduction from 
matrix multiplication to Cholesky decomposition, thus extending the bandwidth cost 
lower bounds of [53] and [35] to a bandwidth cost lower bound for the sequential 
and parallel implementations of Cholesky decomposition. We also discuss latency 
cost lower bounds. In Sj3]we present known Cholesky decomposition algorithms and 
compare their bandwidth and latency costs with the lower bounds. 

A shorter version of this paper (an extended abstract) appeared in the proceedings 
of the SPAA 2009 conference; the shorter version includes the reduction proof of the 
lower bounds and Tables [Ll] and [131 This paper proves the claims made in the tables 
by presenting the algorithms and their analyses in Ej3j 

2 Toledo's algorithm is designed to retain numerical stability for the LU case. The algorithm 
of 20 deals with the Cholesky case only and therefore requires no pivoting for numerical stability. 
Thus a simpler recursion suffices, and the latency cost diminishes. 
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Table 1.1 

Sequential bandwidth and latency costs: lower bound vs. algorithms. M denotes the size of 
the fast memory. We assume n 2 > M in this table. FLOPs count of all is 0(n 3 ). Refer to ip| for 
definitions of terms and details. 
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Table 1.2 

Parallel bandwidth and latency costs: lower bound vs. algorithms. M denotes the size of the 
memory of each processor. P is the number of processors, b is the block size. Refer to j|3] for 
definitions of terms and details. 



2. Communication Lower Bounds. Consider an algorithm for a parallel com- 
puter with P processors that multiplies matrices in the "classical" way (the usual 
0(n 3 ) arithmetic operations possibly reordered using associativity and commutativ- 
ity of addition) and each of the processors has memory of size M. Irony et al. [25] 
showed that at least one of the processors has to send or receive this minimal number 
of words: 

Theorem 2.1 f |25j). Any "classical" implementation of matrix multiplication of 
nxn matrices A and B on a P processor machine, each equipped with memory of size 

3 

M , requires that one of the processors sends or receives at least 1 — M words. 

2 V / 2PM2 

If A and B are of size nxm and mxr respectively, then the corresponding bound 

IS ^rm_. _ M . 
2 v / 2PAf 2 

As any processor has memory of size M, any message it sends or receives may 
deliver at most M words. Therefore we deduce the following: 

Corollary 2.2. Any "classical" implementation of matrix multiplication of nxn 
matrices A and B on a P processor machine, each equipped with memory of size M , 
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requires that one of the processors sends or receives at least — - — 3 — 1 messages. 

2 V / 2PM2 

If A and B are of size nxm and mxr respectively, then the corresponding bound 

is nmr , - 1. 
2V2PM2 

For the case of P = 1 these coincide with the lower bounds for the bandwidth 
and the latency costs of the sequential case. The lower bound on bandwidth cost 
of sequential matrix multiplication was previously shown (up to some multiplicative 
constant factor) by Hong and Kung [23 . 

One means of extending these lower bounds to more algorithms is by reduction. 
It is easy to reduce matrix multiplication to LU decomposition of a slightly larger 
order, as the following identity shows: 





This identity means that LU factorization without pivoting can be used to perform 
matrix multiplication; to accommodate pivoting A and/or B can be scaled down 
to be too small to be chosen as pivots, and A ■ B can be scaled up accordingly. 
Thus an 0(n 3 ) implementation of LU decomposition that only uses associativity and 
commutativity of addition to reorganize its operations (thus eliminating Strassen- 
like algorithms) must perform at least as much communication as a correspondingly 
reorganized implementation of 0(n 3 ) matrix multiplication. 

We wish to mimic this lower bound construction for Cholesky. Consider the 
following reduction from matrix multiplication to Cholesky decomposition. Let T be 
the matrix defined below, composed of 9 square blocks each of dimension n; then the 
Cholesky decomposition of T is: 



T=\ A I + A-A T \ = \ A I • / A-B \ =L-L T 




I A 1 -B\ I I \ / 1 A 

1 

\~B T D ) \-B T (A ■ B) T X) 

where X is the Cholesky factor of D' = D - B T B - B T A T AB, and D can be any 
symmetric matrix such that D' is positive definite. 

Thus A ■ B is computed via this Cholesky decomposition. Intuitively this seems to 
show that the communication complexity needed for computing matrix multiplication 
is a lower bound to that of computing the Cholesky decomposition (of matrices three 
times larger) as A ■ B appears in L T , the decomposition of T. Note however that 
A ■ A T also appears in T . Conceivably, computing A ■ B from A and B incurs less 
communication cost if we are also given A-A T , so the above is not a sufficient reduction 
from matrix multiplication to Cholesky decomposition]^] Let us instead consider the 
following approach to prove the lower bound. 

In addition to the real numbers K, consider new "starred" numerical quantities, 
called 1* and 0*, with arithmetic properties detailed in the following tables. 1* and 
0* mask any real value in addition/substraction operation, but behave similarly to 
1 £ R and £ M in multiplication and division operations. 

Consider this set of values and arithmetic operations. 
• The set is commutative with respect to addition and to multiplication (by 
the symmetries of the corresponding tables). 



3 Note that computing A ■ A T is asymptotically as hard as matrix multiplication: take A = 
[A,0;y T ,0]. Then A- A T = [*,A:y;*,*] 
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Table 2.1 

Arithmetic Operations: x,y stand for any real values. For consistency, 



I 0* and —1* 



• The set is associative with respect to addition: regardless of ordering of sum- 
mation, the sum is 1* if one of the summands is 1*, otherwise it is 0* if one 
of the summands is 0* . 

• The set is also associative with respect to multiplication: (a ■ b) ■ c — a ■ (b ■ c). 
This is trivial if all factors are in K. As 1* is a multiplicative identity, it is 
also immediate if some of the factors equal 1*. Otherwise, at least one of the 
factors is 0*, and the product is 0. 

• Distributivity, however, does not hold: 1 ■ (1* + 1*) = 1 ^ 2 = (1 ■ 1*) + (1 • 1*) 
Let us return to the construction. We set T' to be: 




where C has 1* on the diagonal and 0* everywhere else: 

/l* 0* ••• 0*\ 

0* 1* 0* : 

C = 

: 0* 

\0* ■■■ 0* I* J 

One can verify that the (unique) Cholesky decomposition of C i^] 



C = 



(1* 

0* 

Vo* 



0* 





l*J 



fl* 0* 



0*\ 





iV 



= c c 



IT 



(2.1) 



Note that if a matrix X does not contain any "starred" values 0* and 1* then 
X = C X = X C = C'-X = X-C = C lT ■ X = X ■ C' T and C + X = C. Therefore, 



4 By writing X ■ Y we mean the resulting matrix assuming the straightforward n 3 matrix multi- 
plication algorithm. This has to be stated clearly, as the distributivity does not hold for the starred 
values. 
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one can confirm that the Cholesky decomposition of T' is: 

(I A T -B\ (I \ (I A T -B\ 

T' = [ A C = [A C ■ C' T A ■ B ) = L ■ L T (2.2) 

\~B T C) \-B T (A ■ B) T C'J \ C' T ) 

One can think of C as masking the A ■ A T previously appearing in the central 
block of T, therefore allowing the lower bound of computing A ■ B to be accounted 
for by the Cholesky decomposition, and not by the computation of A ■ A T . More 
formally, let Alg be any classical algorithm for Cholesky factorization. We convert it 
to a matrix multiplication algorithm as follows: 

Algorithm 1 Matrix Multiplication by Cholesky Decomposition 
Input: Two nxn matrices, A and B. 
1: Let Alg' be Alg updated to correctly handle the new 0*,1* values, {note that 
Alg' can be constructed off-line.} 



Construct T' as in Equation (2.2) 
L = Alg'{T) 
return (L 32 ) T 



The simplest conceptual way to do step (1) is to attach an extra bit to every 
numerical value, indicating whether it is "starred" or not, and modify every arith- 
metic operation to check this bit before performing an operation. This increases the 
bandwidth cost by at most a constant factor. Alternatively, we can use Signalling 
NaNs as defined in the IEEE Floating Point Standard [I] to encode 1* and 0* with 
no extra bits. 

If the instructions implementing Cholesky are scheduled deterministically, there is 
another alternative. One can run the algorithm "symbolically" , propagating 0* and 1* 
arguments from the inputs forward, simplifying or eliminating arithmetic operations 
whose inputs contain 0* or 1*. One can also eliminate operations for which there 
is no path in the directed acyclic graph (describing how outputs of each operation 
propagate to inputs of other operations) to the desired output A ■ B. The resulting 
Alg' performs a strict subset of the arithmetic and memory operations of the original 
Cholesky algorithm. 

We note that updating Alg to form Alg' is done off-line, so that step (1) does not 
actually take any time to perform when Algorithm [l] is called. 

Classical Cholesky Decomposition Algorithms. We next verify the correctness of 
this reduction: that the output of this procedure on input A, B is indeed the mul- 
tiplication A ■ B, as long as Alg is a classical algorithm, in a sense we now define 
carefully. 

Let T' — L- L T be the Cholesky decomposition of T' . Then we have the following 
formulas: 



L(i,i)= T'(i,i)- Y, ( L M)) 5 



k£[i-l] 



L(i,j) 



i T'M- E L(i,k)-L{j,k)),i>j 



(2.3) 



(2.4) 
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where [t] = {1, t}. A "classical" Cholesky decomposition algorithm computes each 
of these 0(n 3 ) flops, which may be reordered using only commutativity and asso- 
ciativity of addition. By the no-pivoting and no-distributivity restrictions on Alg, 
when an entry of L is computed, all the entries on which it depends have already 
been computed and combined by the above formulas, with the sums occurring in any 
order. These dependencies form a dependency graph on the entries of L, and impose 
a partial ordering on the computation of the entries of L (see Figure 2.1). That is, 
when an entry L(i,i) is computed, by Equation (2.3), all the entries {L(i, k)} ke ^_^ 



have already been computed^] Denote this set of entries by S^j, namely 



Si ti = {L(i, k)} ke [i-i] 



(2.5) 



Similarly, when an entry L(i,j) (for i > j) is computed, by Equation (2.4), all the 
entries {L(i, k)} k £[j-i\ and all the entries {L(j, k)}k^[j] have already been computed. 
Denote this set by Sij namely, 



J i,3 



{L(i, fc)}fce[j-i] U {L(j, k)} ke[j] 



(2.6) 




Fig. 2.1. Dependencies of L(i,j), for diagonal entries (left) and other entries (right). 
Dark grey represents the sets Si^ (left) and Si j (right). Light grey represents indirect dependencies. 



Lemma 2.3. Any ordering of the computation of the elements of L that respects 
the partial ordering induced by the above mentioned directed acyclic graph results in 
a correct computation of A - B. 

Proof. We need to confirm that the starred entries 1* and 0* of T' do not somehow 
"contaminate" the desired entries of Lj 2 - The proof is by induction on the partial 
order on pairs implied by (2.5) and (2.6). The base case — the correctness of 

computing L(l, 1) — is immediate. Assume by induction that all elements of Sij are 
correctly computed and consider the computation of L(i,j) according to the block in 
which it resides: 

• If L(i,j) resides in block Ln, L21 or L31 then Si.j contains only real val- 
ues, and no arithmetic operations with 0* or 1* occur (recall Figure 2.1 or 
Equations (2. 2), (2. 5) and (2.6)). Therefore, the correctness follows from the 



correctness of the original Cholesky decomposition algorithm. 
If L(i, j) resides in L 2 2 or L 33 then S^j may contain "starred" value (elements 
of C"). We treat separately the case where L(i,j) is on the diagonal and the 
case where it is not. 



5 While this partial ordering constrains the scheduling of flops, it does not uniquely identify 
a computation DAG (directed acyclic graph), for the additions within one summation can be in 
arbitrary order (forming arbitrary subtrees in the computation DAG). 
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If i = j then by Equation ( 2.3 1 L(i, i) is determined to be 1* since T'{i, i) = V 



and since adding to, subtracting from and taking the square root of 1* all 



result in 1* (recall Table 2.1 and Equation (2.3)) 



If i > j then by the inductive assumption the divisor L(j,j) of Equation 



(2.4) is correctly computed to be 1* (recall Figure 2.1 and the definition of 
C in Equation ( |2.1[ )). Therefore, no division by 0* is performed. Moreover, 
T'(i,j) is 0*. Then L(i,j) is determined to be the correct value 0*, unless 



1* is subtracted (recall Equation (2.4)). However, every subtracted product 



(recall Equation ( |2.4[ )) is composed of two factors of the same column but of 
different rows. Therefore, by the structure of C , none of them is 1* so their 
product is not 1* and the value is computed correctly. 

If L{i, j) resides in L 32 then Si j may contain "starred" values (see Figure 



2.1 right-hand side, row j). However, every subtraction performed (recall 



Equation (2.4)) is composed of a product of two factors, of which one is on 
the ith row (and on a column k < j). Hence, by induction (on the 
(i, k) element has been computed correctly to be a real value, and by the 
multiplication properties so is the product. Therefore no masking occurs. 
This completes the proof of Lemma |2.3| □ 

Communication Analysis. We now know that Algorithm [T] correctly multiplies 
matrices "classically" , and so has known communication lower bounds given by The- 
orem 2.1 and Corollary 2.2 It remains to confirm that Step 2 (setting up T") and 
Step 4 (returning £32) do not require much communication, so that these lower bounds 
apply to Step 3, running Alg' (recall that Step 1 may be performed off-line and so 
doesn't count). Since Alg' is either a small modification of Cholesky to add "star" 
labels to all data items (at most doubling the bandwidth cost), or a subset of Cholesky 
with some operations omitted (those with starred arguments, or not leading to the 
desired output £32), a lower bound on communication for Alg' is also a lower bound 
for Cholesky. 

Theorem 2.4 (Main Theorem). Any sequential or parallel classical algorithm 
for the Cholesky decomposition of n-by-n matrices can be transformed into a classical 
algorithm for r ^-by-^ matrix-multiplication, in such a way that the bandwidth cost of 
the matrix-multiplication algorithm is at most a constant times the bandwidth cost of 
the Cholesky algorithm. 

Therefore any bandwidth or latency cost lower bound for classical matrix multi- 
plication applies to classical Cholesky, asymptotically speaking: 

Corollary 2.5. In the sequential case, with a fast memory of size M, the band- 
width cost lower bound for Cholesky decomposition is f2(n 3 /M 1 ' 2 ), and the latency 
cost lower bound is f2(n 3 /M 3 ^ 2 ). 

Proof. Constructing T' (in any data format) requires bandwidth of at most 18n 2 
(copying a 3n-by-3n matrix, with another factor of 2 if each entry has a flag indicating 
whether it is "starred" or not), and extracting L^ 2 requires another n 2 of bandwidth. 
Furthermore, we can assume n 2 < n 3 /M 1 / 2 , i.e., that M < n 2 , i.e., that the matrix is 
too large to fit entirely in fast memory (the only case of interest). Thus the bandwidth 
lower bound f2(n 3 /M 1 / 2 ) of Algorithm [I] dominates the bandwidth costs of Steps 2 
and 4, and so must apply to Step 3 (Cholesky). Finally, as each message delivers at 
most M words, the latency lower bound for Step 3 is by a factor of M smaller than 
its bandwidth cost lower bound, as desired. □ 

Corollary 2.6. In the parallel case, with a 2D layout on P processors, the 
bandwidth cost lower bound for Cholesky decomposition is Q(n 2 / 'P 1 ! 2 ), and the latency 
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cost lower bound is ^(P 1 / 2 ). 

Proof. The argument in the parallel case is analogous to that of Corollary |2.5| 
The construction of input and retrieval of output at Steps 2 and 4 of Algorithm 

jlj contribute bandwidth of O^-j^^j. Therefore the lower bound of the bandwidth 



fl j is determined by Step 3, the Cholesky decomposition. The lower bound 

on the latency of Step 3 is therefore £1 ^ pj ^ 3/2 ^ , as each message delivers at most M 
words. Plugging in M = 9 (^) yields B = ^(P 1 / 2 ). □ 



3. Upper Bounds. In this section we discuss known algorithms for Cholesky 
decomposition and their bandwidth and latency cost analyses, in the sequential two- 
level memory model, in the sequential hierarchical memory model, and in the parallel 
computing model. Throughout this section, we will use the notation B(n) and L(n) 
for bandwidth and latency costs, respectively, where n is the dimension of the square 
matrix. We note that these functions may also depend on the parameters M, the 
size of the fast memory, and P, the number of processors, but we omit its reference 
when it is clear from context. See also [TU] for latency analysis of two of the Cholesky 
decomposition algorithmsin the two-level (out-of-core) memory model. 

In §10.1.1 of [22], standard error analyses of Cholesky decomposition algorithm 



are given. These hold for any ordering of the summation of Equations (2.3) and (2.4), 
and therefore apply to all Cholesky decomposition algorithms below. 

3.1. Sequential Algorithms. 

3.1.1. Data Storage. Before reviewing the various sequential algorithms let 
us first consider the underlying data-structure which is used for storing the matrix. 
Although it does not affect the bandwidth cost, it may have a significant influence on 
the latency cost of the algorithm: it may or may not allow retrieving many relevant 
words using only a single message. As a simple example, consider computing the sum 
of n < M numbers in slow memory, which obviously requires reading these n words. 
If they are in consecutive memory locations, this can be done in one read operation, 
the minimum possible latency cost. But if they are not consecutive, say they are 
separated by at least M — 1 words, this may require n read operations, the maximum 
possible latency cost. 

The various methods for data storage (only some of which are implemented in 



LAP AC K) appear in Figure 3.1 We can partition these data structures into two 
classes: column-major and block-contiguous. 

Column-Major Storage. The column-major data structures store the matrix en- 
tries in column-wise order. This means that they are most fit for algorithms that 
access a column at a time (e.g., the left and right looking naive algorithms). How- 
ever, the latency cost of algorithms that access a block at time (e.g., LAPACK's 
implementation) will fail to achieve optimality, as retrieving a block of size 6x6 re- 
quires at least 6 messages, even if a single message can deliver 6 2 words. This means 
a possible increase in the latency cost by a factor of 6 (where b is typically the order 
of \/M). 

As the matrices of interest for Cholesky decomposition are symmetric, storing the 
entire matrix (known as 'Full storage') wastes space. About half of the space can be 
saved by using the 'old packed' or the 'rectangular full packed' storages. The latter 
has the advantage of a more uniform indexing, which allows faster addressing. 
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Block-Contiguous Storage. The block-contiguous data structures store the matrix 
entries in a way that allows a read or write of a block using a single message. This 
may reduce the latency cost of a Cholesky decomposition algorithm that accesses a 
bxb block at a time by a factor of b, compared with using column-major storage. We 
distinguish between two types of block-contiguous storage formats. The blocked data 
structure stores each block of the matrix in a contiguous space of the memory. For 
this, one has to know in advance the size of blocks to be used by the algorithm (which 
is a machine-specific parameter). 

The elements of each block may be stored as contiguous sub-blocks, where each 
sub-block is of a predefined size. The data structure may include several such layers of 
sub-blocks. This 'layered' data structure may fit the hierarchical memory model (see 



[3.2 for further discussion of this model). The next data structure allows the benefits 
of block-contiguous storage without knowledge of machine-specific parameters. 

The block-recursive format [51 [T5J Q~71 (also known as the bit interleaved layout, 
space-filling curve storage, or Morton ordering format) stores each of the four n/2xn/2 
submatrices contiguously, and the elements of each submatrix are ordered so that the 
smaller submatrices are each stored contiguously, and so on recursively. This format 
is 'cache-oblivious'. This means that it allows access to a single block using a single 
message (or a constant number of messages), without knowing in advance the size of 



the blocks (see Figure 3.1 1. Both the blocked and block-recursive formats have packed 
versions, where about half of the space is saved by using the symmetry of the matrices 
(see [15] for recursive full packed data structures). The algorithm in [3] uses a hybrid 
data structure in which only half the matrix is stored and recursive ordering is used on 
triangular sub-matrices and column-major ordering is used on square sub-matrices. 

Conversion on the fly. Since block-contiguous storage yields better latency cost 
than column-major storage for some algorithms, it can be worthwhile to convert a 
matrix stored in column-major order to the blocked data structure (with block size 
b = O(M)) before running the algorithm. This can be done by reading O(M) elements 
at a time, in a columnwise order (which requires one message) then writing each of 
these elements to the right location of the new matrix. We write these words using 
Q(yM) messages (one per each relevant block). Thus, the total number of messages is 
O (^jf) which is asymptotically dominated by O ( A " 3/2 ^ for M = f2(n). Converting 
back to column-major storage can be done in similar way (with the same asymptotic 
cost). 

Other Variants of these Data Structures. Similar to the column-major data struc- 
tures, there are row-major data structures that store the matrix entries row-wise. All 
the above-mentioned packed data structures have versions that are indexed to effi- 



ciently store the lower (as in Figure 3.1) and upper triangular part of a matrix. 

We consider the application of each of the algorithms below to column-major or 
block-contiguous data structures only. Adapting each of these algorithms to other 
compatible data structures (row-major vs. column-major, upper triangular vs. lower 
triangular, full storage vs. packed storage) is straightforward. 



3.1.2. Arithmetic Count of Sequential Algorithms. The arithmetic oper- 
ations of all the sequential algorithms considered below are exactly the same, up to 
reordering. The arithmetic count of all these algorithm is therefore the same, and is 
given only once. 
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By Equations ( 2.3 },( 2.4 1 the total arithmetic count is: 



i=l j=l 

3.1.3. Upper Bound for Matrix Multiplication. In this section we estab- 
lish the bandwidth and latency costs of matrix multiplication, which is used as a 
subroutine in some of the Cholesky decomposition algorithms. 

We recall the recursive matrix multiplication algorithm of Frigo, Leiserson, Prokop 
and Ramachandran |17j . Their algorithm, given as Algorithm [2j works by a divide- 
and-conquer approach, where at each step the algorithm splits the largest of three 
dimensions. 

Lemma 3.1 ([17]) . The bandwidth cost BMM{n,m,r) of multiplying two matrices 
of dimensions nxm and mxr is 



B MM (n,m,r) = 6 



nmr 



This result is stated slightly differently in [T7]. We obtain the form above by 
setting the cache- line length (or transfer block size) equal to 1. The latency cost of 
this algorithm varies according to the data structure used and the dimensions of the 
input and output matrices. 

Lemma 3.2. When using recursive contiguous-block data structure, the latency 
cost of matrix multiplication is 

/ nmr nm + mr + nr \ 
L MM (n, m,r) = e + j , 

and when using column-major or row-major layout, the latency cost is 

. ( nmr nm + mr + nr\ 
L MM (n, m, r) = & \ H I • 
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Algorithm 2 C = RMatMul(A, B): A Recursive Matrix Multiplication Algorithm 
Input: A,B, two matrices of dimensions nxm and mxr. 
Output: C, a matrix of dimension nxr, so that C = A ■ B 



\\ Partition A, B, C by dividing each dimension in half, e.g. A = 

if n = to = r = 1 then 

Cn = An ■ Bn 
else if n = max{n, to, r} then 

{Cxi Ci 2 ) = RMatMul ((A n A i2 ) , B) 
(C21 C22) = RMatMul ((A 2 i A 22 ) , B) 
else if m = max{n, to, r} then 
'An 

A 2 i 1 ' 1 °" in ~ 

'A i2 



An A 12 x 

^■21 ^22, 



C = RMatMul 



C = C + RMatMul 
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9: else 

C n 
C21, 
C12 

C 2 2y 

12: end if 
13: return C 



A 



(Bn Bi : 

, (B 2 i B 22/ 



RMatMul 
RMatMul 



B lx 
B 2 i 
Bi 2 
B 22 



Proof. Let m',n',r' be the dimensions of the problem the first time all three 
matrices fit into the fast memory. Then m',n',r' differ by at most a factor of 2, 
and thus are all Q(VM). Therefore, assuming the recursive contiguous-block data 
structure, each of the three matrices resides in 0(1) square blocks, and so reading 
and/or writing each matrix incurs a latency cost of 0(1) ■ Thus, the total latency cost 

is L 1Hf (n)=9(^). 

If the column-major or row-major data structures are used, then each such block 
of size 8(v / M)x8(-\/M) is read or written using Q(VM) messages (one for each of 



its rows or columns), and therefore the total latency cost is Lmm(ji) = 



^MM(n) 



We note that Algorithm [2] is cache-oblivious. That is, the algorithm achieves the 
bandwidth and latency costs given above with no knowledge of the size of the fast 
memory. The iterative blocked matrix multiplication algorithm can also achieve this 

performance, but the block size must be chosen to be 9 

3.1.4. Upper Bound for Triangular Solve. In this section we establish the 
bandwidth and latency costs of another subroutine used in some of the Cholesky 
decomposition algorithms, that of a triangular solve with multiple right hand sides 
(TRSM). Algorithm [3] is a recursive version of TRSM (a variation of this algorithm 
appears in [3], designed to utilize a non-recursive, tuned implementation of matrix 
multiplication). Below, we assume each matrix multiplication is implemented with 
the recursive algorithm (Algorithm [2]). 

Bandwidth Cost. As no communication is needed for sufficiently small matrices 
(other than reading the entire input, and writing the output), the bandwidth cost of 
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Algorithm 31 = RTRSM (A, U): Recursive Triangular Solver 
Input: A, U two nxn matrices, U is upper triangular. 
Output: A, so that X = A ■ U^ 1 

\\ Partition A, U,X by dividing each dimension in half, e.g. A — ( ^f 11 J \ 12 

\A 2 i A22 

1: if n — 1 then 

2: X = A/U 

3: else 

4: X u = RTRSM(A n ,U n ) 

5: X12 = RTRSM (A 12 - X u ■ U12, U22) 

6: X21 = RTRSM(A 21 ,U n ) 

7: X22 = RTRSM (A22 - X21 ■ C/12, U22) 

8: end if 

9: return X 



this algorithm is given by the recurrence 

r — ) ^ • Btrsm (f ) + 2 • Bmm(% ) ifn>W4p 

[ 3n otherwise 

where BMM( n ) = BMM{ n T n , n ) is the bandwidth cost of multiplying two n-by-n 
matrices. Assuming the matrix- multiplication is done using Algorithm [2j we have 

cost of recursive TRSM) is then 

Btrsm ( n ) = O ( + n 



BMAt(n) = O i^jf +n> ). The solution to this recurrence (and the total bandwidth 



Latency Cost. Assuming a block-contiguous data structure (recursive, or with the 
correct block size picked), the latency cost is given by the recurrence 

LtrsmH < { 4 ' L ™ (f ) + 2 ■ Lmm ^ [in >\ff 
[ 3 otherwise 

where LmmI") = = O ^ A " 3/2 ^ is the latency cost of recursive matrix 

multiplication. The solution to this recurrence is 



Ltrsm {n) = O 



( n 3 



Again, we note that Algorithm [3] is cache-oblivious. The iterative blocked TRSM 
algorithm can also achieve this performance, but the block size must be chosen to be 

e (Vm). 

Equipped with the communication costs of matrix multiplication and triangular 
solve, we proceed to the following subsections, where we present several Cholesky 
decomposition algorithms and provide the communication cost analyses which yield 
the first five enumerated conclusions in the introduction as well as the results shown 
in Table O 



16 



BALLARD, DEMMEL, HOLTZ, AND SCHWARTZ 



3.1.5. The Naive Left-Looking Cholesky Algorithm. We start with revis- 
iting the two naive algorithms (left-looking and right-looking), and see that both have 
non-optimal bandwidth and latency, as stated in Conclusion [T] of the introduction. 

The naive left-looking Cholesky decomposition algorithm is given in Algorithm [4] 
and Figure 3.2 Note that Algorithm [4] assumes that two columns (k and j) of the 
matrix can fit into fast memory simultaneously (i.e., M > In). 



Algorithm 4 Naive left-looking Cholesky algorithm 
l: for j = 1 to n do 

2: read A(j : n,j) from slow memory 

3: for k = 1 to j — 1 do 

4: read A(J : n, k) from slow memory 

5: update diagonal element: A(j,j) 4— A(j,j) — A(j,k) 2 

6: for i = j + 1 to n do 

7: update j th column element: A(i,j) <— A(i,j) — A(i, k)A(j, k) 

8: end for 
9: end for 

10: calculate final value of diagonal element: A(j,j) <— y/A(j, j) 
11: for i = j + 1 to n do 

12: calculate final value of j th column element: A(i,j) -s— A(i,j)/A(j,j) 
13: end for 

14: write A(j : n,j) to slow memory 
15: end for 



Bandwidth Cost. Assuming M > In, we follow Algorithm [4] and communication 
occurs at lines |2|4| and[l4j so the total number of words transferred between fast and 
slow memory while executing the entire algorithm is given by 



E 



'3-1 



2(n-i + l)+ ^(n-j + l) 



\k=l 



1 Q o 5 

= -n + n 2 + 



6 



6 



In the case when M < 2n, each column j is read into fast memory in segments 
of size M/2. For each segment of column j, the corresponding segments of previous 
columns k are read into fast memory in sequence to update the current segment. In 
this way, the total number of words transferred between fast and slow memory does 
not change. 

Latency Cost. Assuming M > In and the matrix is stored in column-major order, 
each column is contiguous in memory, so the total number of messages is given by 



E 



\k = \ / 



1 2 

= 2 n 



In the case when M < 2n, an entire column cannot be read/written in one 
message, so the column must be read/written in multiple messages of size O(M). 
Again, assuming the matrix is stored in column-major order, segments of columns 
will be contiguous in memory, and the latency is given by the bandwidth divided by 

the message size: O OmJ- 

The algorithm can be adjusted to work one row at a time (up-looking) if the 
matrix is stored in row-major order (with no change in bandwidth or latency), but 
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k j j k 



Fig. 3.2. Naive algorithms. Left: left-looking. Right: right-looking. Column j is currently 
being computed (both algorithms). Light grey is the area already computed. Column k is the column 
being read (left-looking) or written (right-looking). 



a block-contiguous storage format will increase the latency (columns would not be 



contiguous in memory in this case). This analysis, along with that of § 3.1.6 yields 
Conclusion [T] in the introduction. 

3.1.6. The Naive Right-Looking Cholesky Algorithm. The naive right- 
looking Cholesky decomposition algorithm is given in Algorithm [5] and Figure |3.2| 
Note that Algorithm [5] assumes that two columns (k and j) of the matrix can fit into 
fast memory simultaneously (i.e., M > 2n). 



Algorithm 5 Naive right-looking Cholesky algorithm 
l: for j = 1 to rt do 

2: read A(j : n,j) from slow memory 

3: calculate final value for diagonal element: A(j,j) <— y/A(j,j) 
4: for i = j + 1 to n do 

5: calculate final value for j th column element: A(i,j) <— A(i,j)/A(j,j) 
6: end for 

7: for k = j + 1 to n do 

8: read A(k : n, k) from slow memory 

9: for i = k to n do 

10: update k th column element: A(i, k) 4— A(i, k) — A(i,j)A(k,j) 

11: end for 

12: write A{k : n, k) to slow memory 
13: end for 

14: write A(j : n, j) to slow memory 
15: end for 



Bandwidth Cost. Assuming M > 2n, we follow Algorithm [5] and communication 
occurs at lines [2j [8j [l2j and[l4j so the total number of words transferred between fast 
and slow memory while executing the entire algorithm is given by 



E 



2(7i - j + 1) + J2 2 ( rl - k + 1) 



fe=i 



= -n 3 + n 2 + -i 
3 3 



In the case when M < 2n, more communication is required. In order to perform 
the column factorization, the column is read into fast memory in segments of size 
M — 1, updated with the diagonal element (which must remain in fast memory), 
and written back to slow memory. In order to update the trailing k th column, the 
k th column is read into fast memory in segments of size (M — l)/2 along with the 
corresponding segment of the j th column. In order to compute the update for the 
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entire column, the k th element of the j th column must remain in fast memory. After 
updating the segment of the k th column, it is written back to slow memory. Thus, 
the naive right-looking algorithm requires more reads from slow memory when two 
columns of the matrix cannot fit into fast memory, but this increases the bandwidth 
by only a constant factor. 

Latency Cost. Assuming M > In and the matrix is stored in column-major order, 
the total number of messages is given by 



E 




= n 2 + n. 



As in the case of the naive left-looking algorithm, when M < 2n, the size of a 
message is no longer the entire column. Assuming the matrix is stored in column- 
major order, message sizes are of the order equal to the size of fast memory. Thus, 
for 0(n 3 ) bandwidth, the latency is O Omj- 

This right-looking algorithm can be easily transformed into a down-looking row- 
wise algorithm in order to handle row-major data storages, with no change in band- 
width or latency. This analysis, along with that of § 3.1.5 yields Conclusion [I] in the 
introduction. 

3.1.7. LAPACICs POTRF. We next consider an implementation available in 
LAP AC K (see [5]) and show that it is bandwidth-optimal when using the right block 
size (as stated in Conclusion [2] of the introduction) and can also be made latency- 
optimal, assuming the correct data structure is used (as stated in Conclusion [3] of the 
introduction). 

LAPACLCs Cholesky algorithm, POTRF (Algorithm [6]) , is a blocked left-looking 
algorithm. For simplicity, we assume the block size 6 evenly divides the matrix size n. 
See Figure [373] for a diagram of the algorithm. Note that by the partitioning of line[2j 
A 2 i is 6x (j — 1)6, A 22 is bxb, A 3 i is (n/b — j)bx (j — 1)6, and A 32 is (n/b — j)bxb. 

Algorithm 6 LAPACK POTRF 
1: for j = 1 to n/b do 

2: partition matrix so that diagonal block is A 22 '- 



An 


* 


* 


Mi 


^22 


* 




A 32 


-433 



update diagonal block (SYRK): A 22 <- A 22 - Ai\A\\ 
factor diagonal block (j, j) (P0TF2): A 22 <- Chol(A 22 ) 
update column panel (GEMM): A 32 <- A 32 - A 31 A T 
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triangular solve for column panel (TRSM): A 32 A 32 A 22 J 
end for 



The communication costs of Algorithm [6] depend on the subroutines which per- 
form symmetric rank-6 update, matrix multiply, and triangular solve, respectively. For 
simplicity, we will assume that the symmetric rank-6 update is computed with the 
same bandwidth and latency as a general matrix multiply. Although general matrix 
multiply requires more communication, this assumption will not affect the asymptotic 
count of the communication required by the algorithm. We will also assume that the 
block size 6 is chosen so that three blocks can fit into fast memory simultaneously; 
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■«=> 



Fig. 3.3. LAPACK 's POTRF Cholesky Decomposition Algorithm 



that is, we assume that 



1 < b < 

In this case, the factorization of the diagonal block (line [4]) requires only 9(6 2 ) words 
to be transferred since the entire block fits into fast memory. Also, in the triangular 
solve for the column panel A32 (line [6]), the triangular matrix A22 is only one block, 
so the computation can be performed by reading the blocks of A 32 in sequence and 
updating them individually in fast memory. Thus, the amount of communication 
required by that subroutine is 9(6 2 ) times the number of blocks in the column panel. 
Finally, the dimensions of the matrices in the matrix multiply of line [3] during the j th 
iteration of the loop are 6x6, bx(j — 1)6, and (j — 1)6x6, and the dimensions of the 
matrices in the matrix multiply of line during the j th iteration are (n/b — j)bxb, 
(n/b — j)bx (j — 1)6, and (j - 1)6x6. 

Bandwidth Cost. Under the assumptions above, an upper bound on the number 
of words transferred between slow and fast memory while executing Algorithm [6] is 
given by 



%/b 

B(n) < [B MM (b, (j - 1)6, 6) + 9(6 2 ) + B MM ((n/b - j)b, (j - 1)6, 6) 
i=i 

+(n/6-j)e(6 2 ) 



where _Bj\/m(w, n, r) is the bandwidth cost required to execute a matrix multiplication 
of matrices of size mxn and nxr in order to update a matrix of size nxr. Since Bmm 
is nondecreasing in each of its variables, we have 

B{n)<™ \B MM (b, n, 6) + 9(6 2 ) + B MM (n, n, 6) + ^9(6 2 ) 
6 L 6 



< 



2B M Af(n,n,6) + 6(6 2 ) + -e(6 2 ; 





Assuming the matrix-multiply algorithm used by LAPACK achieves the same band- 



width cost as the one given in ^3.1.3 (the iterative blocked algorithm with the correct 
block size suffices), we have 



B M M{n,n,b) = 9 



n 2 6 
sfM 



nb 



Since 6 < vM and 6 < n, BMM(n,n,b) = 0(n 2 ). Thus, 

,3 



B(n) = O 
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Fig. 3.4. Rectangular Recursive Cholesky Algorithm. 



and choosing a block size b = 8(vM) gives 



B(n) = O ( -— +n 2 



and achieves the lower bound (as stated in Conclusion [2] of the introduction). Note 
that choosing a block size b — 1 reduces the blocked algorithm to the naive left-looking 
algorithm (see Algorithm [4| which has bandwidth cost 0(n 3 ). 

Latency Cost. Since all reads and writes are done block by block, the optimal 
latency cost 

L{n) = O 1 



is achieved if a block-contiguous data structure is used with block size of Q(VM). 
However, as the current implementations of LAP AC K use column-major data struc- 
tures, the latency cost in practice is 



L(n) = O - + 



M Jm 



As mentioned in § 3.1.1 in the case that M = Q(n), converting a matrix in column- 
major order to the blocked layout incurs a bandwidth cost of 0(n 2 ) and latency cost 

of O ^"3/2 ^ ■ Thus, in this case, the costs of conversion-on-the-fly and Algorithm j^J 
combined still achieve the communication lower bounds. These results are stated in 
Conclusion [3] of the introduction. 

3.1.8. Rectangular Recursive Cholesky Algorithm. The next recursive al- 
gorithm for Cholesky decomposition and its bandwidth analysis follow the recursive 
L [/-decomposition algorithm of Toledo (see [28 ). The algorithm given here (Algo- 
rithm [7| is in fact a simplified version of Toledo's: there is no pivoting, and L — U T . 
For completeness we repeat the bandwidth analysis and provide a latency analysis. 
The bandwidth proves to be optimal (as stated in Conclusion [2] of the introduction), 
but the latency does not (as stated in Conclusion [3] of the introduction). 

Bandwidth Cost. The analysis of the bandwidth cost follows that of Toledo j2S! • 
Following Algorithm [7] we either read/write one column (if n = 1) or make two 
recursive calls and perform one matrix multiplication. Therefore, the bandwidth cost 
is given by the recurrence 



tji x / B (m, I) + B MM (m 
S(TO ' n) = \2m 



|,|,|) +J B(m-f,|) ifn>l 

if n = 1 



COMMUNICATION-OPTIMAL CHOLESKY DECOMPOSITION 



21 



Algorithm 7 L = Rectangular RChol(A): Rectangular Recursive Cholesky Algo- 
rithm 



Input: A, an mxn section of positive semidcfinitc matrix (to > n). See Figure 3.4 

for block partitioning. 
Output: L, a lower triangular matrix, so that A = L ■ L T 



1 then 




{See f 3.1.3 



for RMatMul.} 



This recurrence assumes that no more than one column of the matrix can fit in 
fast memory (to < M). We note that Algorithm [7] communicates less than Toledo's 
algorithm, so the analysis of |28j provides an upper bound on the solution of this 
recurrence. One can use similar analysis to show in the case that to > M , B(m, n) = 

ft (j^jj + mnlognj , yielding the following lemma. 

Lemma 3.3 ( 28J ) . Given a matrix multiplication subroutine with bandwidth cost 
equivalent to Algorithm^ the bandwidth cost of the Rectangular Recursive Cholesky 
Algorithm on an mxn matrix (m > n) where at most one column of the matrix can 
fit into fast memory at a time (m> M) is 



B(m, n) = O I — ^= + ran log n 

Thus, the application of the algorithm to an n-by-n matrix (where n 2 > M) yields 
a bandwidth cost of 



Bin, n) = 6 —= + n log n 

2 

This is optimal, provided that M < 1q ^ 2 n , or, since any algorithm is bandwidth 
optimal when the whole matrix fits into fast memory, when n 2 < M. Hence the 
range of non-optimality is very small, so we consider Algorithm [7] to have optimal 
bandwidth cost. 

We note that in the case where to < M, the analysis is slightly different. In the 
case that multiple columns fit into fast memory, the bandwidth cost recurrence is 



B(m, n) 



B (to, I) + B MM {m - a, f, f ) +B (m - |, f ) if mn > M 
O(mn) if mn < M 



where the base case arises when the subproblem fits entirely in fast memory (which 
may be before Algorithm [7] reaches its base case) . The solution of this recurrence is 
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Fig. 3.5. Square Recursive Cholesky Algorithm. 



B(m, n) = 8 {^jj + mnlog ,J f^j , which is smaller than the cost in the case m > 
M. Because this difference is so small, we omit the distinction in Conclusion [2] and 
Table O 

Latency Cost. For the latency cost, we consider the column-major and block- 
recursive data storage formats. In order to show that this algorithm does not attain 
optimal latency cost, we provide lower bounds in each case. 

In the case of column-major storage, the latency cost of the entire algorithm is 
bounded below by the multiplication of Step [5] at the top of the recursion tree. This 
is a square multiplication of size ^, so the latency cost with column- major storage 

is fl ^17^ (from £ 3.1.3). Thus, using column-major storage, Algorithm |?j can never 
achieve optimal latency cost. 

In the case of block-recursive storage, the latency of the entire algorithm is 
bounded below by the latency required to resolve the base cases of the recursion. 
Assuming m > M, the base case occurs when n = 1, and the algorithm factors a 
single column. However, the block-recursive data structure does not store columns 
contiguously (up to 2 elements of the same column could be stored consecutively, 
depending on the recursive pattern), so reading a column requires Q(n) messages. 
Since a base case is reached for each column, the latency cost contribution of all the 
base cases is 0(n 2 ), a lower bound for the total latency cost of Algorithm [7J Since n 2 
asymptotically dominates j^w/s f° r M > n 2 ^ 3 , this algorithm cannot achieve optimal 
latency cost in this case either. 

We note that Algorithm [7] is cache-oblivious. 

3.1.9. Square Recursive Cholesky Algorithm. The next recursive algo- 
rithm is a natural adaptation of the rectangular recursive algorithm of § |3.1.8| Instead 
of dividing the matrix only vertically, because no pivoting is required, we can also di- 
vide horizontally, yielding a square recursive algorithm (Algorithm [8]). The algorithm 
has been considered before, but no asymptotic analysis was given prior to this work. 
We believe the algorithm first appears in [5D] and can also be found in [31 01 US [37] . 
Ahmed and Pingali |3 suggest the algorithm (though without asymptotic analysis of 
the bandwidth and latency costs). In [27] Simecek and Tvrdik give a detailed prob- 
abilistic cache-misses performance; however, no asymptotic claims on the worst-case 
bandwidth and latency are stated. We note that the algorithms in [4] [21] achieve 
the optimal bandwidth cost, but they assume data storages that prevent achieving 
optimal latency cost (neither the latency nor the bandwidth cost is explicitly given in 
those papers). 

The analysis in this section proves that Algorithm [8] minimizes the bandwidth 
cost (as stated in Conclusion [2] in the introduction) and, if the block-recursive data 
storage is used, the latency cost as well. 
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Algorithm 8 L = SquareRChol(A): Square Recursive Cholesky Algorithm 
Input: A, an nxn semidefinite matrix. See Figure [3~5] for block partitioning. 
Output: L, a lower triangular matrix, so that A = L ■ L T 
if n = 1 then 

else 

Ln = SquareRChol(An) 

L 21 = RTRSM(A 21 ,Lj 1 ) {See ^A^foiRTRSM .} 
A 22 = A 22 - RMatMul(L 21 , L 21 ) {See j j3.1.3| for RMatMul.} 
L 22 = SquareRChol(A 22 ) 
end if 
return L 



Bandwidth Cost. As no communication is needed for sufficiently small matrices 
(other than reading the entire input and writing the output), the bandwidth cost of 
this algorithm is given by the recurrence 



B(n) 



2 • B (f ) + B TRSM (f ) + B MM (f ) i£n>^Jf 
2n 2 otherwise 



Assuming the matrix multiplication is performed using Algorithm [2j and the trian- 
gular solve is performed using Algorithm |3| the solution to this recurrence (and the 
total bandwidth cost of Algorithm 1st) is 



B(n) = O -== + n 
1 fM 



n ' 



Latency Cost. The recurrence for the latency cost is similar: 

trsm (f ) + L MM (|) if n > y 3 

otherwise, 

Assuming the matrix is stored using the block-recursive data structure and the recur- 
sive subroutines are used, the latency cost is given by 

3 




We note that like the Rectangular Recursive Algorithm, Algorithm [8] is cache- 
oblivious. 

3.2. Sequential Algorithm for more than Two Memory Levels. In real 
computers, there are usually more than two types of memory, and in fact there is 
a hierarchy of memories, ranging from a very fast small memory to the largest and 
slowest memory |4j. We next consider the lower and upper bounds of Cholesky de- 
composition assuming such a model of hierarchical memory machines. We observe 
that none of the known algorithms for Cholesky decomposition allows cache-oblivious 
achievement of optimal latency cost (Conclusion Bl of the introduction), except the 
Square Recursive Cholesky algorithm (Conclusion!^ of the introduction). 
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3.2.1. Lower bound. Explicit lower bounds have been derived for various prob- 
lems (including matrix multiplication) within the hierarchical memory model (see for 
example, 26J). Moreover, the known lower bound for the two-level memory model 
can be applied here, by considering any two consecutive levels of the hierarchy as the 
fast and slow memories, and treating all faster memories as part of the fast memory, 
and all the slower memories as part of the slow one. Assuming the number of levels is 
some constant, this may be the correct lower bound, up to a constant factor. For the 
Cholesky decomposition, this gives the following bandwidth and latency cost lower 
bounds: 

Corollary 3.4. LetAlg be a 'classical' Cholesky decomposition algorithm imple- 
mented on a machine with some constant d levels of memory, of size M\ < • • • < Mj, 
with inverse bandwidth fi\ < ■ ■ ■ < fid an d with latency ol\ < ■ ■ • < a.&. Then the 
bandwidth cost of Alg is 



3.2.2. Upper bounds revisited. An algorithm may fail to achieve optimal 
communication costs on a hierarchical memory model even if it achieves optimal per- 
formance in the two-level memory model. For example, an algorithm that includes a 
parameter (e.g., block size) which allows tuning for better communication performance 
may be harder or impossible to tune optimally in the hierarchical memory model. On 
the other hand, as argued in |17j . a cache-oblivious algorithm which achieves optimal 
communication costs in the two-level memory model will also perform optimally on 
a computer with multiple memory levels. We next revisit the communication perfor- 
mance of the Cholesky decomposition algorithms above in the context of hierarchical 
memory model. 

LAPACK 's PDTRF. In the two-level memory model, the LAP AC K implementa- 
tion can achieve optimal bandwidth cost, and if the block-contiguous data structure 
is used, it can also achieve optimal latency cost. To this end one must tune the block 
size parameter b of the algorithm (and the block size of the data structure). 

This has a drawback when applying the LAPACK algorithm to hierarchical 
memory machines: setting b to fit a smaller memory level results in inefficient band- 
width and latency costs in the higher levels. Setting b according to a larger memory 
level results in block I/O that is too large to fit the smaller levels. Either way, one 
cannot achieve optimal bandwidth and latency costs between every pair of levels in 
the memory hierarchy with only one tunable parameter. 

Rectangular Recursive Algorithm. Recall from § |3.1.8| that the recursive Cholesky 
decomposition algorithm adopted from [28) is cache-oblivious and achieves optimal 
bandwidth cost for practically all ranges of memory sizes, but it cannot guarantee 
optimal latency cost when M > n 2 / 3 . Thus, the algorithm may minimize the band- 
width cost in the hierarchical model, but if, for example, there is a memory level of 
size greater than n 2 / 3 (other than the slowest one holding the original input), then 
this algorithm yields suboptimal latency cost. 




(3.1) 



and the latency cost is 




(3.2) 
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Square Recursive Algorithm. Recall from § |3.1.9| that the Square Recursive algo- 
rithm minimizes both bandwidth and latency costs, and it is cache-oblivious. Thus, 
it achieves communication optimality in the hierarchical memory model and is the 
only algorithm to do so. 

3.3. 2D Parallel Algorithms. Let us now consider the so-called 2D parallel 
algorithms, namely those that assume M = (jjr) local memory size and start with 

the n 2 matrix elements spread across the processors (i.e, no repetition of elements) . 

In the parallel model, communication occurs between pairs of processors, and 
n words sent as one message can be communicated in time a + fin. Unlike the 
sequential model, we do not consider the data structure used to store the parts of 
the matrix local to each processor. We assume that the cost of "packing" a message 
into a contiguous block of memory is included in the time a + fin. We count the 
communication cost along the critical path of the algorithm; that is, if many pairs of 
processors are communicating identical messages simultaneously, the cost along the 
critical path is that of only one message. 

We also consider the cost of a "broadcast," where one processor sends the same 
message to many other processors. Broadcasts can be implemented in many different 
ways, and implementations are often limited by the network topology. Following [llj . 
we assume a broadcast of one message incurs a latency cost of a ■ log P. That is, if 
the root processor sends the message to two processors, and each of those processors 
forward the message to two more, the information can reach P processors after log P 
steps. This implementation requires a topology that includes a binary tree of connec- 
tions among the P processors (a 2D torus or nearest-neighbor connection topology 
would not suffice, for example). 

Assuming a sufficient network topology, we show upper bounds for bandwidth 
and latency costs which match the lower bounds up to a logarithmic factor (as stated 
in Conclusion [6] of the introduction) . 

3.3.1. The ScaLAPACK Implementation. The ScaLAPACK [TT] routine 
PxPOTRF computes the Cholesky decomposition of a symmetric positive definite ma- 
trix A of size nxn distributed over a grid of P processors. The matrix is distributed 
block-cyclically, but only half of the matrix is referenced or overwritten (we assume 
the lower half here). See Algorithm [9] and Figure 3.6 We assume a network topology 



such that each processor column and each processor row are connected via a binary 
tree. 

We assume that the block dimension is bxb where n/b is an integer and the 
processors are organized in a square grid (P r = P c = vP) where n/y/P is also an 
integer. After computing the Cholesky decomposition of a diagonal block (j, j) locally, 
the result must be communicated to all processors which own blocks in the column 
panel below the diagonal block (i.e., blocks (i,j) for j < i < n/b) in order to update 
those blocks. Since the matrix is distributed block-cyclically, there can be at most v P 
such processors. After the column panel is updated using a triangular solve, those 
results must be communicated to other processors in order to perform the rank-6 
updates to blocks in the trailing matrix. For a given block (k, I) in the trailing matrix 
A22, the update depends on the k th block of the column panel (block (k,j)) and the 
transpose of the I th block of the column panel (block Thus, after a processor 

computes an update to block in the column panel, it must broadcast the result 
to processors which own blocks in the i th row panel (P r = \[P different processors). 
Then, after the processor owning the diagonal block receives that update, it re- 
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Algorithm 9 ScaLAPACK PxPDTRF 
l: for j = 1 to n/b do 

2: processor owning block computes Cholesky decomposition 

3: broadcast result to processors down processor column 

4: parallel for each processor owning blocks in column panel j do 

5: update blocks in panel with triangular solve 

6: broadcast results across processor row 

7: end for 

8: parallel for each processor owning diagonal block (i, i) (i > j) do 
9: re-broadcast results down processor column 
10: end for 

11: parallel for each processor owning blocks in trailing matrix do 
12: update blocks with symmetric rank-6 update 
13: end for 
14: end for 



rz: 



It 



B 



it 



B 



11 



B 




Fig. 3.6. ScaLAPACK 's PxPOTRF. 
Left: Block-cyclic distribution of the matrix to the processors. Here n = 24, b = 4, P c = P r = 3. 
Right: Processor grid: information flow. Here P c = P r = 6 . 



broadcasts to processors which own blocks in the i th column panel (P c — \J~P different 
processors). 

The result of the Cholesky decomposition of the diagonal block is a lower triangu- 
lar matrix, so the number of words in each message of the broadcast down the column 
is only 6(6 + 1) /2. A broadcast to P processors requires O (logP) messages. Any pro- 
cessor which owns a block in the column panel below the diagonal block (j, j) will own 
^ft = -j-^= blocks. Such a processor computes the updates for all the blocks it owns, 
and then broadcasts all the results together (total of A= words) to \J~P processors 
(which is done using O (log P) messages) . Once a processor owning the corresponding 
diagonal blocks receives this message, it re-broadcasts the message down the column, 
requiring another O (log P) messages. Thus, the total number of messages along the 
critical path is 

n/b 

]TO(logP) = 0(^logP) 

+i 
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and the number of words along the critical path is 



0(b 2 log P) + 0[^ 7 = log P 

0=1 



P 



0[ [n6 + ^=]logP 



Thus, setting the block size at b = y^pj, the total number of messages required 

is O (^/PlogP^j and the total number of words is O {^jp logP^ . We note that by 
setting b = for example, the matrix is no longer stored block-cyclically. In- 
stead, each processor owns one large block of the matrix, and since the full matrix is 
stored, nearly half of the processors own blocks which are never referenced or updated. 
Further, parallelism is lost as the algorithm progresses across the column panels. How- 
ever, this does not cause an asymptotic degradation in the computation time of the 
algorithm. For each of the y/P column panels, there are three phases of computation: 
Cholesky decomposition of the diagonal block, triangular solve to update the column 
panel, and matrix multiply to update the trailing matrix. Thus, setting b — the 
computation cost of the algorithm (not including lower order terms) is 



1 / n \ f n \ „ / n 

staO + b0 +2 {tt 



10 n 3 

y ' p" 



Thus, in choosing a large block size to attain the latency cost lower bound, we do not 
sacrifice the algorithm's ability to meet the computational asymptotic lower bound]^] 
This analysis yields Conclusion [6] in the introduction. 

4. Conclusion. In this paper we extended known lower bounds on the communi- 
cation cost (both for bandwidth and for latency) of conventional (0(n 3 )) matrix mul- 
tiplication to Cholesky factorization, and we compared the cost of various Cholesky 
decomposition implementations to these lower bounds. We identified and analyzed 
existing algorithms that minimized communication for the two-level, hierarchical, and 
parallel memory models. We showed that the ScaLAPACK implementation mini- 
mized communication in the parallel model, up to an O(logP) factor. However, the 
optimal and cache-oblivious sequential algorithm is yet to be implemented in a pub- 
licly available library, such as LAPACK. Our six main conclusions detailing these 
results are listed in the introduction. 

A 'real' computer may be more complicated than any model we have discussed, 
with both parallelism and multiple levels of memory hierarchy (where each sequential 
processor making up a parallel computer has multiple levels of cache), or with multi- 
ple levels of parallelism (i.e., where each 'parallel processor' itself consists of multiple 
processors), etc. And it may be 'heterogenous', with functional units and communica- 
tion channels of greatly differing speeds. We leave more complicated memory models 
and lower and upper communication bounds on such processors for future work. 

Recently we extended lower bounds to other algorithms in 7J, including QR 
decomposition and algorithms for eigenvalue problems, and identified/developed op- 
timal 0(n 3 ) implementations . We further plan to extend results to Strassen's matrix 
multiplication and other "fast" algorithms. 



6 As noted by Laura Grigori, one can choose a slightly smaller block size and achieve a computa- 
tional cost of g while increasing the latency cost by only a poly logarithmic factor |19| . 
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